<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1.0"/>
  <title>MiVOS Project Website</title>

  <!-- CSS  -->
  <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
  <link href="css/materialize.css" type="text/css" rel="stylesheet" media="screen,projection"/>
  <link href="css/style.css" type="text/css" rel="stylesheet" media="screen,projection"/>
</head>
<body>
  <nav class="blue lighten-2" role="navigation">
    <div class="nav-wrapper container">
    </div>
  </nav>
  <div class="section no-pad-bot" id="index-banner">
    <div class="container">
      <br><br>
      <h5 class="header center blue-text text-darken-2">Modular Interactive Video Object Segmentation: <br> Interaction-to-Mask, Propagation and Difference-Aware Fusion</h5>
      <!-- <h6 class="header center"><i>Under Review</i></h6> -->
      <div class="row center">
        <h5 class="header col s12 light">Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang</h5>
      </div>
      <div class="row center">
        <div class="col s12 center">
          <a href="https://github.com/hkchengrex/MiVOS" class="larger-text">[Code, Models and Results]</a>
          &nbsp;&nbsp;
          <a href="dataset.html" class="larger-text">[Dataset: BL30K]</a>
          &nbsp;&nbsp;
          <a href="video.html" class="larger-text">[Video Results]</a>
          &nbsp;&nbsp;
          <a href="https://arxiv.org/abs/2103.07941" class="larger-text">[arXiv]</a>
          &nbsp;&nbsp;
          <a href="https://arxiv.org/pdf/2103.07941.pdf" class="larger-text">[PDF]</a>
        </div>
      </div>
    </div>
  </div>

  <div class="container">
    <div class="section">

    <div class="row center">
      <div class="col s12 m4 l4 center">
      <figure>
        <video preload="auto" autoplay muted loop class="videoInsert">
          <source src="https://i.imgur.com/zLAJmot.mp4" type="video/mp4">
          Your browser does not support the video tag.
        </video>
      <figcaption><i>motorcross-jump, DAVIS 2017 validation set.</i></figcaption>
      </figure>
      </div>
      
      <div class="col s12 m4 l4 center">
      <figure>
        <video preload="auto" autoplay muted loop class="videoInsert">
          <source src="https://i.imgur.com/rrK3SNj.mp4" type="video/mp4">
          Your browser does not support the video tag.
        </video>
      </video>
      <figcaption><i> Academy of Historical Fencing. <a href="https://youtu.be/966ulgwEcyc">[Source]</a></i></figcaption>
      </figure>
      </div>

      <div class="col s12 m4 l4 center">
      <figure>        
        <video preload="auto" autoplay muted loop class="videoInsert">
          <source src="https://i.imgur.com/XfuChcZ.mp4" type="video/mp4">
          Your browser does not support the video tag.
        </video>
        <figcaption><i>Modern History TV. <a href="https://youtu.be/e_D1ZQ7Hu0g">[Source]</a></i></figcaption>
      </figure>
      </div>
    </div>

      <div class="row">
        <div class="col s12 l10 push-l1 xl8 push-xl2">
          <img class="materialboxed" width="100%" src="https://imgur.com/Iw9uOx5.jpg">
        </div>
      </div>

    </div>

  <div class="divider"></div>
    <div class="section">
      <h5 class="header center">Abstract</h1>
        <div class="row">
          <div class="col s12 l10 push-l1 xl8 push-xl2">
            <p style="text-align: justify;">
              We present Modular interactive VOS (MiVOS) framework which decouples interaction-to-mask and mask propagation, allowing for higher generalizability and better performance. Trained separately, the interaction module converts user interactions to an object mask, which is then temporally propagated by our 
              propagation module using a novel top-$k$ filtering strategy in reading the space-time memory. To effectively take the user's intent into account, a novel difference-aware module is proposed to learn how to properly fuse the masks before and after each interaction, which are aligned with the target frames by  employing the space-time memory.
              We evaluate our method both qualitatively and quantitatively with different forms of user interactions (e.g., scribbles, clicks) on DAVIS to show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions, with the additional advantage in generalizing to different types of user interactions.
              We contribute a large-scale synthetic VOS dataset with pixel-accurate segmentation of 4.8M frames to accompany our source codes to facilitate future research.
            </p>
          </div>
        </div>
    </div>
  </div>

  <footer class="page-footer blue">
    <div class="container">
      <div class="row">
        <div class="col l6 s12">
            <h5 class="white-text">Contact</h5>
            <p class="grey-text text-lighten-4">
                Ho Kei Cheng (<a href = "mailto: hkchengrex@gmail.com" style="color:#a6d9fc;">hkchengrex@gmail.com</a>)
            </p>
        </div>
      </div>
    </div>
    <div class="footer-copyright blue">
    </div>
  </footer>


  <!--  Scripts-->
  <script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
  <script src="js/materialize.js"></script>
  <script src="js/init.js"></script>

  </body>
</html>

<script>
    document.addEventListener('DOMContentLoaded', function() {
    var elems = document.querySelectorAll('.materialboxed');
    var instances = M.Materialbox.init(elems, {});
  });

  // Or with jQuery

  $(document).ready(function(){
    $('.materialboxed').materialbox();
  });
</script>